# Multimodal Model

Spaceom GGUF
Apache-2.0
SpaceOm-GGUF is a multimodal model focusing on visual question answering tasks and performs excellently in spatial reasoning.
Text-to-Image English
S
mgonzs13
196
1
Qwen2 VL 7B Captioner Relaxed GGUF
Apache-2.0
This model is a GGUF format conversion based on Qwen2-VL-7B-Captioner-Relaxed, optimized for image-to-text tasks and supports running via tools like llama.cpp and Koboldcpp.
Image-to-Text English
Q
r3b31
321
1
Vit GPT2 Image Captioning
An image captioning model based on the ViT-GPT2 architecture, capable of generating natural language descriptions for input images.
Image-to-Text Transformers
V
motheecreator
149
0
Vit GPT2 Image Captioning
An image captioning model based on the ViT-GPT2 architecture, capable of generating natural language descriptions for input images.
Image-to-Text Transformers
V
mo-thecreator
17
0
Florence 2 Large TableDetection
MIT
A multimodal table detection model fine-tuned based on the Florence-2 model, capable of precisely locating table areas in images.
Image-to-Text Transformers
F
ucsahin
1,993
18
Paligemma Vqav2
This model is a fine-tuned version of google/paligemma-3b-pt-224 on a subset of the VQAv2 dataset, specializing in visual question answering tasks.
Text-to-Image Transformers
P
merve
168
13
Chexagent 2 3b
CheXagent is a foundational model focused on chest X-ray interpretation, designed to assist in medical imaging analysis.
Image-to-Text Transformers Other
C
StanfordAIMI
28.72k
4
Vit Base Patch16 224 Turkish Gpt2 Medium
Apache-2.0
This is a vision encoder-decoder model based on ViT and Turkish GPT-2 for generating Turkish image captions.
Image-to-Text Transformers Other
V
atasoglu
14
0
Xrayclip Vit L 14 Laion2b S32b B82k
CheXagent is a foundation model specifically designed for chest X-ray interpretation, capable of automatically analyzing and interpreting chest X-ray images.
Image-to-Text Transformers
X
StanfordAIMI
975
0
Chartllama 13b
Apache-2.0
ChartLlama is a multimodal model based on the LLaVA-1.5 architecture, specializing in chart understanding and analysis tasks.
Large Language Model Transformers English
C
listen2you002
144
19
Blip Image Captioning Base Test Sagemaker Tops 3
Bsd-3-clause
This model is a fine-tuned version of Salesforce's BLIP image captioning base model on the SageMaker platform, primarily used for image caption generation tasks.
Image-to-Text Transformers
B
GHonem
13
0
Swin Aragpt2 Image Captioning V3
An image captioning model based on Swin Transformer and AraGPT2 architecture, capable of generating textual descriptions for input images.
Image-to-Text Transformers
S
AsmaMassad
18
0
Saved Model Git Base
MIT
A vision-language model fine-tuned on image folder datasets based on microsoft/git-base, primarily used for image caption generation tasks
Image-to-Text Transformers Other
S
holipori
13
0
Blip2 Flan T5 Xl Sharded
MIT
This is a sharded version of the BLIP-2 model implemented with Flan T5-xl for image-to-text tasks such as image captioning and visual question answering. Sharding allows it to be loaded in low-memory environments.
Image-to-Text Transformers English
B
ethzanalytics
71
6
Image Caption
Apache-2.0
An image caption generation model based on the VisionEncoderDecoder architecture, capable of converting input images into natural language descriptions.
Image-to-Text Transformers
I
jaimin
14
2
Clip Vit Large Patch14 Ko
MIT
Korean CLIP model trained via knowledge distillation, supporting Korean and English multimodal understanding
Text-to-Image Transformers Korean
C
Bingsu
4,537
15
Layoutlmv3 Finetuned Wildreceipt
A version fine-tuned on the WildReceipt dataset based on the LayoutLMv3-base model, designed for receipt key information extraction tasks
Text Recognition Transformers
L
Theivaprakasham
118
3
Vitgpt2 Vizwiz
A vision-language model based on ViT-GPT2 architecture for image-to-text tasks
Image-to-Text Transformers
V
gagan3012
24
1
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase